On Checkpoint Latency

نویسنده

Nitin H. Vaidya

چکیده

Checkpointing and rollback is a technique for minimizing loss of computation in presence of failures. Two metrics can be used to characterize a checkpoint-ing scheme: (i) checkpoint overhead (increase in the execution time of the application because of a checkpoint), and (ii) checkpoint latency (duration of time required to save the checkpoint). For many checkpoint-ing methods, checkpoint latency is larger than checkpoint overhead. This paper evaluates the expression for \average overhead" of the checkpointing scheme as a function of checkpoint latency and overhead. It is shown that the \average overhead" is much more sensitive to the changes in checkpoint overhead, as compared to checkpoint latency. Also, for equi-distant checkpoints, the optimal checkpoint interval is shown to be independent of the checkpoint latency.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Impact of Checkpoint Latency on the Optimal Checkpoint Interval and Execution Time

The massive scale of current and next-generation massively parallel processing (MPP) systems presents significant challenges related to fault tolerance. In particular, the standard approach to fault tolerance, application-directed checkpointing, puts an incredible strain on the storage system and the interconnection network. This results in overheads on the application that severely impact perf...

متن کامل

Design and Evaluation of a Low-Latency Checkpointing Scheme for Mobile Computing Systems

Fault-tolerant mobile computing systems have different requirements and restrictions, not taken into account by conventional distributed systems. This paper presents a coordinated checkpointing scheme which reduces the delay involved in a global checkpointing process for mobile systems. A piggyback technique is used to track and record the checkpoint dependency information among processes durin...

متن کامل

On the Viability of Checkpoint Compression for Extreme Scale Fault Tolerance

The increasing size and complexity of high performance computing (HPC) systems have lead to major concerns over fault frequencies and the mechanisms necessary to tolerate these faults. Previous studies have shown that state-of-the-field checkpoint/restart mechanisms will not scale sufficiently for future generation systems. In this work, we explore the feasibility of checkpoint data compression...

متن کامل

Another Two - Level Failure Recovery Scheme : Performance

This report deals with the design and evaluation of a \two-level" failure recovery scheme for distributed systems. In our previous work 30, 32], we motivated a \two-level" recovery approach that tolerates the more probable failures with a low overhead, and less probable failures with possibly higher overhead. The two-level approach can achieve a smaller overhead as compared to traditional recov...

متن کامل

A Low-Latency DMR Architecture with Efficient Recovering Scheme Exploiting Simultaneously Copiable SRAM

This paper presents a novel architecture for a fault-tolerant high-performance system using a checkpoint/restart approach with dual modular redundancy (DMR). The proposed architecture can perform low-latency copy with instantaneously copiable SRAM. Furthermore, we can use an instantaneous comparison scheme that has more fault coverage than comparison with a cyclic redundancy check (CRC). Evalua...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره شماره

صفحات -

تاریخ انتشار 1995

On Checkpoint Latency

نویسنده

چکیده

منابع مشابه

Impact of Checkpoint Latency on the Optimal Checkpoint Interval and Execution Time

Design and Evaluation of a Low-Latency Checkpointing Scheme for Mobile Computing Systems

On the Viability of Checkpoint Compression for Extreme Scale Fault Tolerance

Another Two - Level Failure Recovery Scheme : Performance

A Low-Latency DMR Architecture with Efficient Recovering Scheme Exploiting Simultaneously Copiable SRAM

عنوان ژورنال:

اشتراک گذاری